Multilayer neural networks with extensively many hidden units 
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The information processing abilities of a multilayer neural network with a number of hidden units 
scaling as the input dimension are studied using statistical mechanics methods. The mapping from 
the input layer to the hidden units is performed by general symmetric Boolean functions whereas 
the hidden layer is connected to the output by either discrete or continuous couplings. Introducing 
an overlap in the space of Boolean functions as order parameter the storage capacity if found to scale 
with the logarithm of the number of implementable Boolean functions. The generalization behaviour 
is smooth for continuous couplings and shows a discontinuous transition to perfect generalization 
for discrete ones. 



Statistical mechanics investigations of artificial neural 
networks continue to play a stimulating and integrating 
role in the scientific dialogue between discipline as di- 
verse as neurophysiology, mathematical statistics, com- 
puter science and information theory. In particular the 
study of feed-forward neural networks pioneered by Gard- 
ner jl] has revealed a variety of interesting results on 
how these system may learn different tasks of informa- 
tion processing from examples (for a review see Of 
particular importance in this respect are multilayer net- 
works (MLN) because of their ability to implement any 
function between input and output |^ which makes them 
attractive candidates for many practical applications. It 
is well known that very many hidden units are needed 
in order to realize this vast computational complexity. 
However, statistical mechanics studies of MLN have so 
far been mostly restricted to systems with very few hid- 
den units as compared to the number of inputs In 
the present letter we overcome this limitation and study 
the storage and generalization abilities of a tree MLN in 
which the size of the hidden layer scales in the same way 
as the input dimension. 

We consider a MLN with N binary hidden units = 
±1, i — 1, N feeding a binary output a = sgn(^j JiTi) 
through a coupling vector J = Ji , . . . , Jn ■ The hidden 
units are determined via Boolean functions Ti = Bi{Si) 
by disjoint sets of inputs Si = Sii,...,SiL containing L 
elements each. We are interested in the limit N ^ 00 
with L remaining constant. 

In order to keep the connection with neural network 
architectures we restrict ourselves to symmetric Boolean 
functions characterized by Bi{—Si) = —Bi{Si). There 
are 2^ such functions with L inputs, with only few 
of them realizable by a coupling vector according to 
Bi{Si) ^ sgniY^jWijSij). For L = 3 there are, e.g., 16 
symmetric Boolean functions but only 14 of them are 
linearly separable. 

In order to investigate the storage and generalization 
properties of the network we consider a set of aLN inputs 



^i,fJ- — l,...,aLN the components ffj^,. 



, of which 



are independent, identically distributed random variables 
with zero mean and unit variance. We then ask for the 
ability of the network to map these inputs on outputs 
(jf^ — 1 for all fi by adapting the Boolean functions Bi 
and the couplings Ji appropriately. 

The central quantity in the statistical mechanics anal- 
ysis is the quenched entropy 
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where d/i(J) is the proper measure in the space of cou- 
plings J, the trace denotes the sum over all Boolean func- 
tions, the product is non-zero only if the arguments of all 
of the ^-functions is positive and the double angle stands 
for the average over the inputs. The determination of s 
can be performed using the replica trick and introducing 
the overlap between two solutions in the combined space 
of couplings J and Boolean functions Bi of the form 
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with the average being now over a single, L-component 
vector ^. Exploiting the fact that this average involves 
a finite number of terms only and assuming replica sym- 



metry q' 



ab 



q for a ^ b we can write s using standard 



techniques in the form 



extr [Gc{qA) 
9,9 



Gs{q) + aLGE{q)]. 



(3) 



with the explicit expressions for the functions Gc , Gs 
and Ge depending on L, the constraints on J and on 
whether the storage or the generalization problem is ad- 
dressed. 

Let us begin with the storage problem by asking for 
the storage capacity Uc defined as the maximal value 
of a for which the system can still realize all desired 
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input-output mappings with probabihty 1. Performing 
the rephca Hmit with the number of rephcas tending to 
zero characteristic for this problem we find 



GE{q) 



DtlnHiQt) (4) 



with the abbreviations Dt 



dte-''/yV2^, H{x) 



Dt, and Q = q/ (1 — q). The expressions for Gc 
and Gs depend on the constraints on the coupling vec- 
tor J. 

A particular simple case is given by Ising couplings 
Ji = ±1. From the symmetry of the Boolean functions 
considered it is clear that it is sufficient to consider Ji — 
1 for all i. Consequently in this case all flexibility of 
the network rests in the choice of the Boolean functions 
between input and hidden layer and g is a sole overlap 
in the space of these Booleans. We find Gc = q{l — 
q)/2 where q denotes the conjugate order parameter to q. 
Moreover, in the case where all 2^ symmetric Boolean 
functions arc admissible we use the identity 



with the sums and products over ^ running over all 2^~^ 
configurations of ^ with = 1 to find 
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Under the transformations q i-^ 2^~^q and a i-+ 2^^^a/L 
the resulting expression for the entropy maps exactly on 
the result for the Ising perceptron corresponding to i = 1 
and we may therefore use the well known results for this 
case 0. Accordingly the storage capacity is overesti- 
mated by the replica symmetric expression and the cor- 
rect result 



a^{L) = ac(l) 2^-^/L ^ 0.83 2^-'^/L 



(6) 



is given by the value of a at which the entropy s{a) turns 
negative. The storage capacity is hence proportional to 
the logarithm of the number of implementable Boolean 
functions. This result is in accordance also with the rig- 
orous upper bound ac < 2^~^ / L resulting from the an- 
nealed entropy s*"™ = (2^^^ — aL)ln2. As in the case 
of the Ising perceptron this bound is related to informa- 
tion theory. The full specification of the network with all 
Ji = 1 requires N 2^"^^ bits of information necessary to 
pin down the N Boolean functions Bi. Therefore the ma- 
chine cannot store more than N 2^~^ bits and ac cannot 
exceed 2^-^lL. 

Fig.|l| compares the analytical result ac(3) = 1.11 for 
L = 3 with numerical simulations using exact enumera- 
tions. Even for the small sizes accessible to this numerical 



technique we find a steepening of the transition with in- 
creasing N and a crossing point of the curves near to the 
theoretical prediction. 

If the trace over the Boolean functions in (|l]) is re- 
stricted to those which can be realized by perceptrons 
with coupling vectors the exact mapping on the Ising 
perceptron no longer holds. Solving the corresponding 
extremum conditions numerically for i = 3 we find 
etc = 1.06 for this case. The reduction of ac compared to 
the unrestricted case is roughly as the reduction in the 
logarithm of the number of admissible Boolean functions 
1.06/1.11 ?^ln(14)/ln(16). 
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FIG. 1. Fraction / of SaAi" random input-output mappings 
implementable by a MLN with 3Af inputs and A'' hidden units 
as function of a for N = 3 (squares) and N = 5 (circles) . The 
couplings between hidden units and output are fixed to Ji = 1 
for all i and enumerations are performed over all combinations 
of symmetric Boolean functions Bi between input and hidden 
layer. For every value of a, 200 realizations of Gaussian inputs 
where averaged over. The solid line gives the analytical result 
describing the limit N ^ oo. 

It is possible to generalize the above analysis to the 
case of discrete couplings with finite synaptic depth I of 
the form Ji = ±1/1, ±2/1, ±1 by building on the anal- 
ysis of the analogous case for the perceptron [^|j8j. In 
this case the additional order parameter q = ^i{Ji)'^/N, 
and its conjugate, q have to be introduced. For Ge we 
then again find (^) with now Q = \/q/{q ~ q)- More- 
over Gc — ~qq + qq/2 and, if all symmetrical Boolean 
functions are admissible. 
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with Tr J denoting the trace over the 21 possible values of 
the couplings Ji . Using these results we have numerically 
calculated the storage capacity adl) for the simplest case 
L = 3 as a function of the synaptic depth /. The results 
are shown in fig.|^ together with a fit to the asymptotic 
behavior. The capacity increases from ac ^ 1.11 of the 
Ising case, I — l,to roughly 1.7 for large I. It is rather dif- 
ficult to compare these analytical findings with numerical 
simulations since the effects of the finite synaptic depth 
do not show up at the small values of TV accessible to 
exact enumerations fotl. 
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FIG. 2. Storage capacity of a MLN with N ^ oo hidden 
units and 3N inputs with couplings Ji between hidden layer 
and output taking 21 discrete values. The inputs are mapped 
to the hidden layer by symmetric Boolean functions Bi. The 
solid line is the fit Qc ~ 1.70 — 0.91/1 to the asymptotic be- 
havior, the dashed line gives the replica symmetric result for 
continuous couplings Ji. 

To complete the analysis of the storage properties we 
analyze the case of continuous couplings J between hid- 
den and output layer. It is convenient to eliminate 
the additional order parameter k necessary in this case 
to enforce the normalization = iV by introducing 
Q — q/{k + q). Within replica symmetry the quenched 
entropy s is then again of the form (|^) with Gc = 0, Ge 
given by (^), and the extremum taken now over Q and 
Q. Moreover 
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The storage capacity ac can be obtained from these ex- 
pressions in the limits Q — s- cx), Q — s- oo corresponding to 
q ^ 1. This limit indicates that different solutions of the 



storage problem may at most differ in a non-extensive 
number of components Ji and Boolean functions Bi . We 
then find Ge ^ ^ and, if all Boolean functions are 
admissible, Gs ~ Q(l/2 -|- (2^^^ - l)/n) giving rise to 



1(2^-1 _1) 



(8) 



For L = 3 this yields af-^ = 2/3 -t- 4/7r = 1.94. If only 
linearly separable Boolean functions implementable by 
coupling vectors are considered the asymptotic be- 
havior of Gs is more difficult to obtain. For the case 
L — 3 we find ac = 1.85. Again the relative reduction 
of ac when compared to the unrestricted case is roughly 
given by the ratio of the logarithms of the number of 
available Boolean functions per hidden unit. 

It is possible to derive an upper bound for ac as has 
been done for MLN with a finite number of hidden units 
I p^ by using some exact results for the perceptron [ pd] |. 
For L = 3 we find adL = 3) < 2.394 and the replica sym- 
metric result is therefore within the bound. For large L 
the bound is given by adL oo) < 2^~^ / L -I- In 2 and 
shows the same scaling with L as (|[). Nevertheless the 
replica symmetric result (H) is very likely to overestimate 
the storage capacity as can be seen from fig.^in which the 
result for ac for L = 3 is included as horizontal dashed 
line. Unlike the case of the perceptron the values for 
ac for finite synaptic depth seem not to approach the 
value for continuous couplings when I oo. It would 
hence be very interesting to investigate the implications 
of replica symmetry breaking, both on the case of con- 
tinuous couplings and of couplings with finite synaptic 
depth ||. 

Let us finally elucidate the generalization problem, i.e. 
the ability of the network to infer a rule from examples. 
To this end we consider as usual two networks of the same 
type with the couplings and Boolean function of one of 
them (the "teacher") fixed at random. The other net- 
work (the "student" ) receives a set of randomly chosen 
inputs — 1, ...,aLN together with the correspond- 

ing outputs (T^ generated by the teacher. The task for 
the student is to imitate the teacher as well as possible. 
The success in doing so is quantified by the generaliza- 
tion error e defined as the probability that a newly chosen 
random input is classified differently by teacher and stu- 
dent. 

As is well known the statistical mechanics analysis of 
the generalization problem builds again on the expression 
(|l|) for the quenched entropy with the number of replicas 
now tending to 1 rather than to (l^,^ . A nice feature of 
this limit is that replica symmetry is known to be stable. 
The order parameter q defined in (||) now gives the typical 
overlap between teacher and student and determines the 
generalization error e in a simple way. In the present 
situation we have the standard relation e = {avccos q) / tt . 
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Moreover (||) is replaced by 

GE{q)^2 J DtH{Qt) \nH{Qt). (9) 

The case using Ising couplings Ji = ±1 and all symmet- 
ric Boolean functions can again be mapped exactly on 
the Ising perceptron. Correspondingly there is a discon- 
tinuous transition to perfect learning, s = for a > ad 
with ad — 1-24 This transition occurs when 

all Boolean functions of the student "lock" onto the cor- 
responding input-hidden mappings of the teacher and is 
also expected to occur in the case where only a restricted 
set of Booleans can be implemented. 

For continuous couplings we find Gc — Q^Q/{{i + 
Q2)(i_g)) and 

Gs = ^ln(l -Q) + f l[Dz^ g{z^) lng{z^) 

(10) 

where 

g(z«) = Trexp(|;(^z«B(Of). (11) 

For small a this gives rise to e ^ 1/2 — Q;-L/(7r^2^^^) 
which coincides with the result for the perceptron for 
L=l as it should. With increasing L the initial decay 
of the generalization error becomes slower reflecting the 
increasing complexity and storage abilities of the net- 
work. There is no retarded learning because of the non- 
zero correlation between the hidden units and the output 
[ p^ . For large a the generalization behaviour is domi- 
nated by the fine tuning of the student couplings between 
hidden layer and output to the respective couplings of 
the teacher resulting in the ubiquitous power law decay 
£ - 0.625/(ia). 

In conclusion we have quantitatively characterized the 
storage and generalization abilities of a multilayer neu- 
ral network with a number of hidden units scaling as 
the input dimension. If the mapping from the input to 
the hidden layer is realized by symmetric Boolean func- 
tions with L inputs the capacity is found to be propor- 
tional to the logarithm of the number of these Boolean 
functions divided by L. The more conventional case in 
which the hidden units are the outputs of perceptrons 
with couplings is more difficult to analyze. However, 
speculating that the above scaling holds true also in this 
case and observing that the logarithm of the number of 
Boolean functions which can be implemented by a per- 
ceptron with L inputs is 0(L^) we arrive at the interest- 
ing result that the number of stored input-output rela- 
tions per weight of the network is proportional to L. This 
implies that doubling the number of couplings in the net- 
work would increase the storage capacity by a factor of 2 
making the proposed architecture superior to MLN with 



few {K <C N) hidden units in which the storage capac- 
ity is known to increase at most logarithmically with the 
number of weights. 
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